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Abstract 

Twitter is often used in quantitative stud¬ 
ies that identify geographically-preferred 
topics, writing styles, and entities. These 
studies rely on either GPS coordinates at¬ 
tached to individual messages, or on the 
user-supplied location field in each profile. 

In this paper, we compare these data ac¬ 
quisition techniques and quantify the bi¬ 
ases that they introduce; we also measure 
their effects on linguistic analysis and text- 
based geolocation. GPS-tagging and self- 
reported locations yield measurably dif¬ 
ferent corpora, and these linguistic differ¬ 
ences arc partially attributable to differ¬ 
ences in dataset composition by age and 
gender. Using a latent variable model 
to induce age and gender, we show how 
these demographic variables interact with 
geography to affect language use. We 
also show that the accuracy of text-based 
geolocation varies with population demo¬ 
graphics, giving the best results for men 
above the age of 40. 

1 Introduction 

Social media data such as Twitter is frequently 
used to identify the unique characteristics of 
geographical regions, including topics of inter¬ 
est (Hong et al., 2012), linguistic styles and di¬ 
alects (Eisenstein et al., 2010; Goncalves and 
Sanchez, 2014), political opinions (Caldarelli et 
al., 2014), and public health (Broniatowski et al., 
2013). Social media permits the aggregation of 
datasets that are orders of magnitude larger than 
could be assembled via traditional survey tech¬ 
niques, enabling analysis that is simultaneously 
fine-grained and global in scale. Yet social media 
is not a representative sample of any “real world” 
population, aside from social media itself. Using 


social media as a sample therefore risks introduc¬ 
ing both geographic and demographic biases (Mis- 
love et al., 2011; Hecht and Stephens, 2014; Lon- 
gley et al., 2015; Malik et al., 2015). 

This paper examines the effects of these bi¬ 
ases on the geo-linguistic inferences that can be 
drawn from Twitter. We focus on the ten largest 
metropolitan areas in the United States, and con¬ 
sider three sampling techniques: drawing an equal 
number of GPS-tagged tweets from each area; 
drawing a county-balanced sample of GPS-tagged 
messages to correct Twitter’s urban skew (Hecht 
and Stephens, 2014); and drawing a sample of 
location-annotated messages, using the location 
field in the user profile. Leveraging self-reported 
first names and census statistics, we show that the 
age and gender composition of these datasets dif¬ 
fer significantly. 

Next, we apply standard methods from the lit¬ 
erature to identify geo-linguistic differences, and 
test how the outcomes of these methods depend 
on the sampling technique and on the underlying 
demographics. We also test the accuracy of text- 
based geolocation (Cheng et al., 2010; Eisenstein 
et al., 2010) in each dataset, to determine whether 
the accuracies reported in recent work will gener¬ 
alize to more balanced samples. 

The paper reports several new findings about 
geotagged Twitter data: 

• In comparison with tweets with self-reported 
locations, GPS-tagged tweets are written 
more often by young people and by women. 

• There are corresponding linguistic dif¬ 
ferences between these datasets, with 
GPS-tagged tweets including more 
geographically-specific non-standard words. 

• Young people use significantly more 
geographically-specific non-standard words. 
Men tend to mention more geographically- 
specific entities than women, but these 



differences arc significant only for individu¬ 
als at the age of 30 or older. 

• Users who GPS-tag their tweets tend to write 
more, making them easier to geolocate. Eval¬ 
uating text-based geolocation on GPS-tagged 
tweets probably overestimates its accuracy. 

• Text-based geolocation is significantly more 
accurate for men and for older people. 

These findings should inform future attempts to 
generalize from geotagged Twitter data, and may 
suggest investigations into the demographic prop¬ 
erties of other social media sites. 

We first describe the basic data collection prin¬ 
ciples that hold throughout the paper (§ 2). The 
following three sections tackle demographic bi¬ 
ases (§ 3), their linguistic consequences (§ 4), and 
the impact on text-based geolocation (§ 5); each 
of these sections begins with a discussion of meth¬ 
ods, and then presents results. We then su mm arize 
related work and conclude. 

2 Dataset 

This study is performed on a dataset of tweets 
gathered from Twitter’s streaming API from 
February 2014 to January 2015. During an ini¬ 
tial filtering step we removed retweets, repetitions 
of previously posted messages which contain the 
“retweeted_status” metadata or “RT” token which 
is widely used among Twitter users to indicate a 
retweet. To eliminate spam and automated ac¬ 
counts (Yardi et al., 2009), we removed tweets 
containing URLs, user accounts with more than 
1000 followers or followees, accounts which have 
tweeted more than 5000 messages at the time of 
data collection, and the top 10% of accounts based 
on number of messages in our dataset. We also re¬ 
moved users who have written more than 10% of 
their tweets in any language other than English, 
using Twitter’s lang metadata field. Exploration 
of code-switching (Solorio and Liu, 2008) and the 
role of second-language English speakers (Eleta 
and Golbeck, 2014) is left for future work. 

We consider the ten largest Metropolitan Sta¬ 
tistical Areas (MSAs) in the United States, listed 
in Table 1. MSAs are defined by the U.S. Cen¬ 
sus Bureau as geographical regions of high popu¬ 
lation with density organized around a single ur¬ 
ban core; they arc not legal administrative divi¬ 
sions. MSAs include outlying areas that may be 
substantially less urban than the core itself. For 
example, the Atlanta MSA is centered on Fulton 


County (1750 people per square mile), but extends 
to Haralson County (100 people per square mile), 
on the border of Alabama. A per-county analysis 
of this data therefore enables us to assess the de¬ 
gree to which Twitter’s skew towards urban areas 
biases geo-linguistic analysis. 

3 Representativeness of geotagged 
Twitter data 

We first assess potential biases in sampling tech¬ 
niques for obtaining geotagged Twitter data. In 
particular, we compare two possible techniques 
for obtaining data: the location field in the user 
profile (Poblete et ah, 2011; Dredze et ah, 2013), 
and the GPS coordinates attached to each mes¬ 
sage (Cheng et ah, 2010; Eisenstein et ah, 2010). 

3.1 Methods 

To build a dataset of GPS-tagged messages, we 
extracted the GPS latitude and longitude coordi¬ 
nates reported in the tweet, and used GIS-TOOLS 1 
reverse geocoding to identify the corresponding 
counties. This set of geotagged messages will be 
denoted T>q- Only 1.24% of messages contain 
geo-coordinates, and it is possible that the individ¬ 
uals willing to share their GPS comprise a skewed 
population. We therefore also considered the user- 
reported location field in the Twitter profile, focus¬ 
ing on the two most widely-used patterns: (1) city 
name, (2) city name and two letter state name (e.g. 
Chicago and Chicago, IV). Messages that matched 
any of the ten largest MSAs were grouped into a 
second set, T>l. 

While the inconsistencies of writing style in 
the Twitter location field are well-known (Hecht 
et ah, 2011), analysis of the intersection between 
T>G and T>l found that the two data sources agreed 
the overwhelming majority of the time, suggest¬ 
ing that most self-provided locations arc accurate. 
Of course, there may be many false negatives — 
profiles that we fail to geolocate due to the use of 
non-standard toponyms like Pixburgh and ATL. If 
so, this would introduce a bias in the population 
sample in T>l. Such a bias might have linguistic 
consequences, with datasets based on the location 
field containing less non-standard language over¬ 
all. 

’https://github.com/DrSkippy/ 

Data-Science-45min-Intros/blob/master/ 
gis-tools-101/gis_tools.ipynb 





Figure 1: Proportion of census population, Twitter messages, and Twitter user accounts, by county. New 
York is shown on the left, Atlanta on the right. 


3.1.1 Subsampling 

The initial samples Vq and Vl were then resam¬ 
pled to create the following balanced datasets: 

GPS-MSA-BALANCED From Vq, we randomly 
sampled 25,000 tweets per MSA as the 
message-balanced sample, and all the tweets 
from 2,500 users per MSA as the user- 
balanced sample. Balancing across MSAs 
ensures that the largest MSAs do not domi¬ 
nate the linguistic analysis. 

GPS-COUNTY-BALANCED We resampled 
Vq based on county-level population (ob¬ 
tained from the U.S. Census Bureau), and 
again obtained message-balanced and user- 
balanced samples. These samples arc more 
geographically representative of the overall 
population distribution across each MSA. 

LOC-MSA-BALANCED From Vl, we randomly 
sampled 25,000 tweets per MSA as the 
message-balanced sample, and all the tweets 
from 2,500 users per MSA as the user- 
balanced sample. It is not possible to obtain 
county-level geolocations in Vl, as exact ge¬ 
ographical coordinates are unavailable. 

3.1.2 Age and gender identification 

To estimate the distribution of ages and genders 
in each sample, we queried statistics from the So¬ 
cial Security Administration, which records the 
number of individuals born each year with each 
given name. Using this information, we obtained 
the probability distribution of age values for each 
given name. We then matched the names against 
the first token in the name field of each user’s 


profile, enabling us to induce approximate distri¬ 
butions over ages and genders. Unlike Facebook 
and Google+, Twitter does not have a “real name” 
policy, so users are free to give names that are 
fake, humorous, etc. We eliminate user accounts 
whose names are not sufficiently common in the 
social security database (i.e. first names which 
are at least 100 times more frequent in Twitter 
than in the social security database), thereby omit¬ 
ting 33% of user accounts, and 34% of tweets. 
While some individuals will choose names not 
typically associated with their gender, we assume 
that this will happen with roughly equal probabil¬ 
ity in both directions. So, with these caveats in 
mind, we induce the age distribution for the GPS- 
MSA-Balanced sample and the LOC-MSA- 
Balanced sample as, 
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We induce distributions over author gender in 
much the same way (Mislove et al., 2011). This 
method does not incorporate prior information 
about the ages of Twitter users, and thus assigns 
too much probability to the extremely young and 
old, who are unlikely to use the service. While it 
would be easy to design such a prior — for exam¬ 
ple, assigning zero prior probability to users under 
the age of five or above the age of 95 — we see 
no principled basis for determining these cutoffs. 
We therefore focus on the differences between the 
estimated pvifl) for each sample V. 




MSA 

Num. 

Counties 

LI Dist. 
Population 
vs. Users 

LI Dist. 
Population 
vs. Tweets 

New York 

23 

0.2891 

0.2825 

Los Angeles 

2 

0.0203 

0.0223 

Chicago 

14 

0.0482 

0.0535 

Dallas 

12 

0.1437 

0.1176 

Houston 

10 

0.0394 

0.0472 

Philadelphia 

11 

0.1426 

0.1202 

Washington DC 

22 

0.2089 

0.2750 

Miami 

3 

0.0428 

0.0362 

Atlanta 

28 

0.1448 

0.1730 

Boston 

7 

0.1878 

0.2303 


Table 1: LI distance between county-level popu¬ 
lation and Twitter users and messages 


3.2 Results 

Geographical biases in the GPS Sample We 

first assess the differences between the true pop¬ 
ulation distributions over counties, and the per- 
tweet and per-user distributions. Because coun¬ 
ties vary widely in their degree of urbanization 
and other demographic characteristics, this mea¬ 
sure is a proxy for the representativeness of GPS- 
based Twitter samples (county information is not 
available for the LOC-MS A-BALANCED sample). 
Population distributions for New York and Atlanta 
arc shown in Figure 1. In Atlanta, Fulton County 
is the most populous and most urban, and is over¬ 
represented in both geotagged tweets and user ac¬ 
counts; most of the remaining counties are corre¬ 
spondingly underrepresented. This coheres with 
the urban bias noted earlier by Hecht and Stephens 
(2014). In New York, Kings County (Brooklyn) 
is the most populous, but is underrepresented in 
both the number of geotagged tweets and user ac¬ 
counts, at the expense of New York County (Man¬ 
hattan). Manhattan is the commercial and enter¬ 
tainment center of the New York MSA, so resi¬ 
dents of outlying counties may be tweeting from 
their jobs or social activities. 

To quantify the representativeness of each sam¬ 
ple, we use the LI distance ||x — y||i = | p c — 

t c |, where p c is the proportion of the MSA pop¬ 
ulation residing in county c and t c is the propor¬ 
tion of tweets (Table 1). County boundaries are 
determined by states, and their density varies: for 
example, the Los Angeles MSA covers only two 
counties, while the smaller Atlanta MSA is spread 
over 28 counties. The table shows that while New 
York is the most extreme example, most MSAs 
feature an asymmetry between county population 
and Twitter adoption. 


Ie4 Number of users in each category 



Number of messages by a user 


Figure 2: User counts by number of Twitter mes¬ 
sages 

Usage Next, we turn to differences between the 
GPS-based and protile-based techniques for ob¬ 
taining ground truth data. As shown in Fig¬ 
ure 2, the LOC-MSA-BALANCED sample con¬ 
tains more low-volume users than either the GPS- 
MSA-balanced or GPS-County-balanced 
samples. We can therefore conclude that the 
county-level geographical bias in the GPS-based 
data does not impact usage rate, but that the differ¬ 
ence between GPS-based and protile-based sam¬ 
pling does; the linguistic consequences of this dif¬ 
ference will be explored in the following sections. 

Demographics Table 2 shows the expected age 
and gender for each dataset, with bootstrap con¬ 
fidence intervals. Users in the LOC-MSA- 
BALANCED dataset are on average two years older 
than in the GPS-MSA-balanced and GPS- 
COUNTY-BALANCED datasets, which are statis¬ 
tically indistinguishable. Focusing on the differ¬ 
ence between GPS-MSA-BALANCED and LOC- 
MSA-BALANCED, we plot the difference in age 
probabilities in Figure 3, showing that GPS- 
MSA-BALANCED includes many more teens and 
people in their early twenties, while LOC-MSA- 
BALANCED includes more people at middle age 
and older. Young people arc especially likely to 
use social media on cellphones (Lenhart, 2015), 
where location tagging would be more relevant 
than when Twitter is accessed via a personal com¬ 
puter. Social media users in the age brackets 18- 
29 and 30-49 arc also more likely to tag their lo¬ 
cations in social media posts than social media 
users in the age brackets 50-64 and 65+ (Zickuhr, 
2013), with women and men tagging at roughly 
equal rates. Table 2 shows that the GPS-MSA- 
balanced and GPS-County-balanced sam- 





Sample 

Expected Age 

95% Cl 

% Female 

95% Cl 

GPS-MSA-balanced 

36.17 

[36.07 - 36.27] 

51.5 

[51.3-51.8] 

GPS-County-balanced 

36.25 

[36.16-36.30] 

51.3 

[51.1-51.6] 

Loc-MSA-balanced 

38.35 

[38.25-38.44] 

49.3 

[49.1-49.6] 


Table 2: Demographic statistics for each dataset 



Figure 3: Difference in age probability distribu¬ 
tions between GPS-MSA-BALANCED and LOO¬ 
MS A-balanced. 

pies contain significantly more women than LOC- 
MSA-BALANCED, though all three samples are 
close to 50%. 

4 Impact on linguistic generalizations 

Many papers use Twitter data to draw conclusions 
about the relationship between language and ge¬ 
ography. What role do the demographic differ¬ 
ences identified in the previous section have on 
the linguistic conclusions that emerge? We mea¬ 
sure the differences between the linguistic corpora 
obtained by each data acquisition approach. Since 
the GPS-MSA-balanced and GPS -County- 
balanced methods have nearly identical pat¬ 
terns of usage and demographics, we focus on the 
difference between GPS-MSA-balanced and 
LOC-MSA-BALANCED. These datasets differ in 
age and gender, so we also directly measure the 
impact of these demographic factors on the use of 
geographically-specific linguistic variables. 

4.1 Methods 

Discovering geographical linguistic variables 

We focus on lexical variation, which is relatively 
easy to identify in text corpora. Monroe et al. 
(2008) survey a range of alternative statistics for 
finding lexical variables, demonstrating that a reg¬ 
ularized log-odds ratio strikes a good balance be¬ 
tween distinctiveness and robustness. A similar 
approach is implemented in SAGE (Eisenstein et 


al., 2011a) 2 , which we use here. For each sam¬ 
ple — GPS-MSA-balanced and Loc-MSA- 
BALANCED — we apply SAGE to identify the 
twenty-five most salient lexical items for each 
metropolitan area. 

Keyword annotation Previous research has 
identified two main types of geographical lexi¬ 
cal variables. The first are non-standard words 
and spellings, such as hella and yinz, which have 
been found to be very frequent in social me¬ 
dia (Eisenstein, 2015). Other researchers have fo¬ 
cused on the “long tail” of entity names (Roller 
et al., 2012). A key question is the relative im¬ 
portance of these two variable types, since this 
would decide whether geo-linguistic differences 
arc primarily topic-based or stylistic. It is there¬ 
fore important to know whether the frequency 
of these two variable types depends on proper¬ 
ties of the sample. To test this, we take the 
lexical items identified by SAGE (25 per MSA, 
for both the GPS-MSA-balanced and Loc- 
MSA-BALANCED samples), and annotate them 
as Nonstandard-Word, Entity-Name, or 
Other. Annotation for ambiguous cases is based 
on the majority sense in randomly-selected exam¬ 
ples. Overall, we identify 24 Nonstandard- 
Words and 185 Entity-Names. 

Inferring author demographics As described 
in § 3.1.2, we can obtain an approximate distri¬ 
bution over author age and gender by linking self- 
reported first names with aggregate statistics from 
the United States Census. To sharpen these esti¬ 
mates, we now consider the text as well, build¬ 
ing a simple latent variable model in which both 
the name and the word counts arc drawn from dis¬ 
tributions associated with the latent age and gen¬ 
der (Chang et al., 2010). The model is shown in 
Figure 4, and involves the following generative 
process: 

For each user i G {l...iV}, 

(a) draw the age, (q ~ Categorical(7r) 

2 https://github.com/jacobeisenstein/jos-gender-2014 



a; Age (bin) for author i 

gi Gender of author i 

w i Word counts for author i 

m First name of author i 

7r Prior distribution over 

age bins 

9 a , g Word distribution for age 

a and gender g 

4>a,g First name distribution 

for age a and gender g 


Figure 4: Plate diagram for latent variable model 
of age and gender 



(b) draw the gender, gi ~ Categorical(0.5) 

(c) draw the author’s given name, m ~ 
Categorical 0 ai , Si ) 

(d) draw the word counts, Wi ~ 
Multinomial^^), 

where we elide the second parameter of the multi¬ 
nomial distribution, the total word count. We use 
expectation-maximization to perform inference in 
this model, binning the latent age variable into 
four groups: 0-17, 18-29, 30-39, above 40. 3 Be¬ 
cause the distribution of names given demograph¬ 
ics is available from the Social Security data, we 
clamp the value of <i> throughout the EM proce¬ 
dure. Other work in the domain of demographic 
prediction often involves more complex meth¬ 
ods (Nguyen et al., 2014; Volkova and Durme, 
2015), but since it is not the focus of our research, 
we take a relatively simple approach here, assum¬ 
ing no labeled data for demographic attributes. 

4.2 Results 

Linguistic differences by dataset We first con¬ 
sider the impact of the data acquisition tech¬ 
nique on the lexical features associated with each 
city. The keywords identified in GPS-MSA- 
BALANCED dataset feature more geographically- 
specific non-standard words, which occur at a rate 
of 3.9 x 10 -4 in GPS-MSA-balanced, versus 
2.6 x 10 -4 in LOC-MSA-BALANCED; this differ¬ 
ence is statistically significant (p < .05, t. = 3.2). 4 

'Binning is often employed in work on text-based age pre¬ 
diction (Garera and Yarowsky, 2009; Rao et at., 2010; Rosen¬ 
thal and McKeown, 2011); it enables word and name counts 
to be shared over multiple ages, and avoids the complexity 
inherent in regressing a high-dimensional textual predictors 
against a numerical variable. 

4 We employ a paired t-test, comparing the difference in 
frequency for each word across the two datasets. Since we 



Age group 

(a) non-standard words 


le-3 



Age group 
(b) entity names 

Figure 5: Aggregate statistics for geographically - 
specific non-standard words and entity names 
across imputed demographic groups, from the 
GPS-MSA-balanced sample. 


For entity names, the difference between datasets 
was not significant, with a rate of 4.0 x 10 -3 for 
GPS-MSA-balanced, and 3.7 x 1CT 3 for Loc- 
MSA-BALANCED. Note that these rates include 
only the non-standard words and entity names de¬ 
tected by SAGE as among the top 25 most distinc¬ 
tive for one of the ten largest cities in the US; of 
course there are many other relevant terms that arc 
below this threshold. 

In a pilot study of the GPS-COUNTY- 
BALANCED data, we found few linguistic differ¬ 
ences from GPS-MSA-balanced, in either the 
aggregate word-group frequencies or the SAGE 
word lists — despite the geographical imbalances 
shown in Table 1 and Figure 1. Informal ex¬ 
amination of specific counties shows some ex¬ 
pected differences: for example, Clayton County, 
which hosts Atlanta’s Hartsfield-Jackson airport, 
includes terms related to air travel, and other coun¬ 
ties include mentions of local cities and business 
districts. But the aggregate statistics for under¬ 
represented counties arc not substantially different 
from those of overrepresented counties, and arc 
largely unaffected by county-based resampling. 

cannot test the complete set of entity names or non-standard 
words, this quantifies whether the observed difference is ro¬ 
bust across the subset of the vocabulary that we have selected. 









Age Sex New York Dallas 


0-17 

F 

M 

niall, ilysm, hemmings, stalk, ily 
ight, technique, kisses, lesbian, dicks 

fanuary, idol, Imbo, lowkey, jonas 
homies, daniels, oomf, teenager, brah 

18-29 

F 

M 

roses, castle, hmmmm, chem, sinking 
drunken, Manhattan, spoiler, guardians, gonna 

socially, coma, hubby, bra, swimming 
harden, watt, astros, rockets, mavs 

30-39 

F 

M 

suite, nyc, colleagues, york, portugal 
mets, effectively, cruz, founder, knicks 

astros, sophia, recommendations, houston, prepping 
texans, rockets, embarrassment, tcu, Mississippi 

40+ 

F 

M 

cultural, affected, encouraged, proverb, unhappy 
renters, investors, shares, lawsuit, theaters 

determine, islam, rejoice, psalm, responsibility 
mph, wazers, houston, tx, Harris 


Table 3: Most characteristic words for demographic subsets of each city, as compared with the overall 
average word distribution 


Demographics Aggregate linguistic statistics 
for demographic groups arc shown in Fig¬ 
ure 5. Men use significantly more geographically- 
specific entity names than women (p <C .01, t = 
8.0), but gender differences for geographically- 
specific non-standard words are not significant 
(p ~ .2). 5 Younger people use significantly 
more geographically-specific non-standard words 
than older people (ages 0-29 versus 30+, p <C 
.01, t = 7.8), and older people mention signifi¬ 
cantly more geographically-specific entity names 
{p <C .01, t = 5.1). Of particular interest 
is the intersection of age and gender: the use 
of geographically-specific non-standard words de¬ 
creases with age much more profoundly for men 
than for women; conversely, the frequency of 
mentioning geographically-specific entity names 
increases dramatically with age for men, but to a 
much lesser extent for women. The observation 
that high-level patterns of geographically-oriented 
language are more age-dependent for men than 
for women suggests an intriguing site for future 
research on the intersectional construction of lin¬ 
guistic identity. 

For a more detailed view, we apply SAGE to 
identify the most salient lexical items for each 
MSA, subgrouped by age and gender. Table 3 
shows word lists for New York (the largest MSA) 
and Dallas (the 5th-largest MSA), using the GPS- 
MSA-BALANCED sample. Non-standard words 
tend to be used by the youngest authors: ilysm (’I 
love you so much’), ight ('alright’), oomf (’one of 
my followers’). Older authors write more about 
local entities ( manliattan, nyc, Houston), with 
men focusing on sports-related entities ( harden, 
watt, astros, mets, texans), and women above the 


’But see Bamman et al. (2014) for a much more detailed 
discussion of gender and standardness. 


age of 40 emphasizing religiously-oriented terms 
( proverb, islam, rejoice, psalm). 

5 Impact on text-based geolocation 

A major application of geotagged social media 
is to predict the geolocation of individuals based 
on their text (Eisenstein et al., 2010; Cheng et 
al., 2010; Wing and Baldridge, 2011; Hong et 
al., 2012; Han et al., 2014). Text-based geolo¬ 
cation has obvious commercial implications for 
location-based marketing and opinion analysis; it 
is also potentially useful for researchers who want 
to measure geographical phenomena in social me¬ 
dia, and wish to access a larger set of individuals 
than those who provide their locations explicitly. 

Previous research has obtained impressive ac¬ 
curacies for text-based geolocation: for exam¬ 
ple, Hong et al. (2012) report a median error of 
120 km, which is roughly the distance from Los 
Angeles to San Diego, in a prediction space over 
the entire continental United States. These accura¬ 
cies are computed on test sets that were acquired 
through the same procedures as the training data, 
so if the acquisition procedures have geographic 
and demographic biases, then the resulting accu¬ 
racy estimates will be biased too. Consequently, 
they may be overly optimistic (or pessimistic!) for 
some types of authors. In this section, we explore 
where these text-based geolocation methods are 
most and least accurate. 

5.1 Methods 

Our data is drawn from the ten largest metropoli¬ 
tan areas in the United States, and we formulate 
text-based geolocation as a ten-way classification 
problem, similar to Han et al. (2014). 6 Using our 

6 Many previous papers have attempted to identify the pre¬ 
cise latitude and longitude coordinates of individual authors, 
but obtaining high accuracy on this task involves much more 





user-balanced samples, we apply ten-fold cross 
validation, and tune the regularization parameter 
on a development fold, using the vocabulary of the 
sample as features. 

5.2 Results 

Many author-attribute prediction tasks become 
substantially easier as more data is avail¬ 
able (Burger et al., 2011), and text-based ge¬ 
olocation is no exception. Since GPS-MSA- 
balanced and Loc-MSA-balanced have 
very different usage rates (Figure 2), perceived dif¬ 
ferences in accuracy may be purely attributable to 
the amount of data available per user, rather than 
to users in one group being inherently harder to 
classify than another. For this reason, we bin users 
by the number of messages in our sample of their 
timeline, and report results separately for each bin. 
All errorbars represent 95% confidence intervals. 

GPS versus location As seen in Figure 6a, there 
is little difference in accuracy across sampling 
techniques: the location-based sample is slightly 
easier to geolocate at each usage bin, but the dif¬ 
ference is not statistically significant. However, 
due to the higher average usage rate in GPS- 
MSA-BALANCED(see Figure 2), the overall accu¬ 
racy for a sample of users will appeal - to be higher 
on this data. 

Demographics Next, we measure classification 
accuracy by gender and age, using the posterior 
distribution from the expectation-maximization al¬ 
gorithm to predict the gender of each user (broadly 
similar results are obtained by using the prior dis¬ 
tribution). For this experiment, we focus on the 
GPS -MSA-balanced sample. As shown in 
Figure 6b, text-based geolocation is consistently 
more accurate for male authors, across almost the 
entire spectrum of usage rates. As shown in Fig¬ 
ure 6c, older users also tend to be easier to ge¬ 
olocate: at each usage level, the highest accuracy 
goes to one of the two older groups, and the dif¬ 
ference is significant in almost every case. As dis¬ 
cussed in § 4, older male users tend to mention 
many entities, particularly sports-related terms; 
these terms are apparently more predictive than 

complex methods, such as latent variable models (Eisenstein 
et al., 2010; Hong et al., 2012), or multilevel grid struc¬ 
tures (Cheng et al., 2010; Roller et al., 2012). Tuning such 
models can be challenging, and the resulting accuracies might 
be affected by initial conditions or hyperparameters. We 
therefore focus on classification, employing the familiar and 
well-understood method of logistic regression. 


the non-standard spellings and slang favored by 
younger authors. 

6 Related Work 

Several researchers have studied how adoption of 
Internet technology varies with factors such as so¬ 
cioeconomic status, age, gender, and living condi¬ 
tions (Zillien and Hargittai, 2009). Hargittai and 
Litt (2011) use a longitudinal survey methodology 
to compare the effects of gender, race, and topics 
of interest on Twitter usage among young adults. 
Geographic variation in Twitter adoption has been 
considered both internationally (Kulshrestha et al., 
2012) and within the United States, using both 
the Twitter location field (Mislove et al., 2011) 
and per-message GPS coordinates (Hecht and 
Stephens, 2014). Aggregate demographic statis¬ 
tics of Twitter users’ geographic census blocks 
were computed by O’Connor et al. (2010) and 
Eisenstein et al. (2011b); Malik et al. (2015) use 
census demographics in spatial error model. These 
papers draw similar conclusions, showing that the 
the distribution of geotagged tweets over the US 
population is not random, and that higher usage 
is correlated with urban areas, high income, more 
ethnic minorities, and more young people. How¬ 
ever, this prior work did not consider the biases 
introduced by relying on geotagged messages, nor 
the consequences for geo-linguistic analysis. 

Twitter has often been used to study the ge¬ 
ographical distribution of linguistic information, 
and of particular relevance are Twitter-based stud¬ 
ies of regional dialect differences (Eisenstein et 
al., 2010; Doyle, 2014; Gonsalves and Sanchez, 
2014; Eisenstein, 2015) and text-based geoloca¬ 
tion (Cheng et al., 2010; Hong et al., 2012; Han et 
al., 2014). This prior work rarely considers the im¬ 
pact of the demographic confounds, or of the geo¬ 
graphical biases mentioned in § 3. Recent research 
shows that accuracies of core language technol¬ 
ogy tasks such as part-of-speech tagging are cor¬ 
related with author demographics such as author 
age (Hovy and Spgaard, 2015); our results on lo¬ 
cation prediction are in accord with these findings. 
Hovy (2015) show that including author demo¬ 
graphics can improve text classification, a similar 
approach might improve text-based geolocation as 
well. 

We address the question about the impact of 
geographical biases and demographic confounds 
by measuring differences between three sampling 




(a) Classification accuracy by sampling 
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techniques, in both language use and in the ac¬ 
curacy of text-based geolocation. Recent unpub¬ 
lished work proposes reweighting Twitter data to 
correct biases in political analysis (Choy et ah, 
2012) and public health (Culotta, 2014). Our 
results suggest that the linguistic differences be¬ 
tween user-supplied profile locations and per- 
message geotags are more significant, and that ac¬ 
counting for the geographical biases among geo- 
tagged messages is not sufficient to offer a repre¬ 
sentative sample of Twitter users. 

7 Discussion 

Geotagged Twitter data offers an invaluable re¬ 
source for studying the interaction of language and 
geography, and is helping to usher in a new gener¬ 
ation of location-aware language technology. This 
makes critical investigation of the nature of this 
data source particularly important. This paper un¬ 
covers demographic confounds in the linguistic 
analysis of geo-located Twitter data, but is lim¬ 
ited to demographics that can be readily induced 
from given names. A key task for future work is to 
quantify the representativeness of geotagged Twit¬ 
ter data with respect to factors such as race and so¬ 
cioeconomic status, while holding geography con¬ 
stant. However, these features may be more diffi¬ 
cult to impute from names alone. Another cru¬ 
cial task is to expand this investigation beyond the 
United States, as the varying patterns of use for so¬ 
cial media across countries (Pew Research Center, 
2012) implies that the findings here cannot be ex¬ 
pected to generalize to every international context. 
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